{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d0d2e007",
   "metadata": {},
   "source": [
    "# Find Label Errors in Token Classification (Text) Datasets\n",
    "\n",
    "This 5-minute quickstart tutorial shows how you can use cleanlab to find potential label errors in text datasets for token classification. In token-classification, our data consists of a bunch of sentences (aka documents) in which every token (aka word) is labeled with one of K classes, and we train models to predict the class of each token in a new sentence. Example applications in NLP include part-of-speech-tagging or entity recognition, which is the focus on this tutorial. Here we use the [CoNLL-2003 named entity recognition](https://deepai.org/dataset/conll-2003-english) dataset which contains around 20,000 sentences with 300,000 individual tokens. Each token is labeled with one of the following classes:\n",
    "\n",
    "- LOC (location entity)\n",
    "- PER (person entity)\n",
    "- ORG (organization entity)\n",
    "- MISC (miscellaneous other type of entity)\n",
    "- O (other type of word that does not correspond to an entity)\n",
    "\n",
    "**Overview of what we'll do in this tutorial:** \n",
    "\n",
    "- Find tokens with label issues using `cleanlab.token_classification.filter.find_label_issues`. \n",
    "- Rank sentences based on their overall label quality using `cleanlab.token_classification.rank.get_label_quality_scores`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "07936a54",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "Quickstart\n",
    "<br/>\n",
    "    \n",
    "cleanlab uses three inputs to handle token classification data:\n",
    "\n",
    "- `tokens`: List whose `i`-th element is a list of strings/words corresponding to tokenized version of the `i`-th sentence in dataset. \n",
    "    Example: `[..., [\"I\", \"love\", \"cleanlab\"], ...]`\n",
    "- `labels`: List whose `i`-th element is a list of integers corresponding to class labels of each token in the `i`-th sentence. Example: `[..., [0, 0, 1], ...]`\n",
    "- `pred_probs`: List whose `i`-th element is a np.ndarray of shape `(N_i, K)` corresponding to predicted class probabilities for each token in the `i`-th sentence (assuming this sentence contains `N_i` tokens and dataset has `K` possible classes). These should be out-of-sample `pred_probs` obtained from a token classification model via cross-validation. \n",
    "    Example: `[..., np.array([[0.8,0.2], [0.9,0.1], [0.3,0.7]]), ...]`\n",
    "\n",
    "Using these, you can find/display label issues with this code: \n",
    "\n",
    "<div  class=markdown markdown=\"1\" style=\"background:white;margin:16px\">  \n",
    "    \n",
    "```python\n",
    "\n",
    "from cleanlab.token_classification.filter import find_label_issues \n",
    "from cleanlab.token_classification.summary import display_issues\n",
    "    \n",
    "issues = find_label_issues(labels, pred_probs)\n",
    "display_issues(issues, tokens, pred_probs=pred_probs, labels=labels,\n",
    "               class_names=OPTIONAL_LIST_OF_ORDERED_CLASS_NAMES)\n",
    "\n",
    "```\n",
    "    \n",
    "</div>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1da020bc",
   "metadata": {},
   "source": [
    "## 1. Install required dependencies and download data\n",
    "\n",
    "You can use `pip` to install all packages required for this tutorial as follows: \n",
    "\n",
    "    !pip install cleanlab "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ae8a08e0",
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget -nc https://data.deepai.org/conll2003.zip && mkdir data \n",
    "!unzip conll2003.zip -d data/ && rm conll2003.zip \n",
    "!wget -nc 'https://cleanlab-public.s3.amazonaws.com/TokenClassification/pred_probs.npz' "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "439b0305",
   "metadata": {
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "# Package installation (hidden on docs website).\n",
    "\n",
    "dependencies = [\"cleanlab\"]\n",
    "\n",
    "if \"google.colab\" in str(get_ipython()):  # Check if it's running in Google Colab\n",
    "    %pip install cleanlab  # for colab\n",
    "    cmd = ' '.join([dep for dep in dependencies if dep != \"cleanlab\"])\n",
    "    %pip install $cmd\n",
    "else:\n",
    "    dependencies_test = [dependency.split('>')[0] if '>' in dependency \n",
    "                         else dependency.split('<')[0] if '<' in dependency \n",
    "                         else dependency.split('=')[0] for dependency in dependencies]\n",
    "    missing_dependencies = []\n",
    "    for dependency in dependencies_test:\n",
    "        try:\n",
    "            __import__(dependency)\n",
    "        except ImportError:\n",
    "            missing_dependencies.append(dependency)\n",
    "\n",
    "    if len(missing_dependencies) > 0:\n",
    "        print(\"Missing required dependencies:\")\n",
    "        print(*missing_dependencies, sep=\", \")\n",
    "        print(\"\\nPlease install them before running the rest of this notebook.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a1349304",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from cleanlab.token_classification.filter import find_label_issues \n",
    "from cleanlab.token_classification.rank import get_label_quality_scores, issues_from_scores \n",
    "from cleanlab.internal.token_classification_utils import get_sentence, filter_sentence, mapping \n",
    "from cleanlab.token_classification.summary import display_issues, common_label_issues, filter_by_token \n",
    "\n",
    "np.set_printoptions(suppress=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9ad75b45",
   "metadata": {},
   "source": [
    "## 2. Get data, labels, and pred_probs\n",
    "\n",
    "In token classification tasks, each token in the dataset is labeled with one of *K* possible classes.\n",
    "To find label issues, cleanlab requires predicted class probabilities from a trained classifier. These `pred_probs` contain a length-*K* vector for **each** token in the dataset (which sums to 1 for each token).  Here we use `pred_probs` which are out-of-sample predicted class probabilities for the full CoNLL-2003 dataset (merging training, development, and testing splits), obtained from a BERT Transformer fit via cross-validation. Our example notebook [\"Training Entity Recognition Model for Token Classification\"](https://github.com/cleanlab/examples/blob/master/entity_recognition/entity_recognition_training.ipynb) contains the code to produce such `pred_probs` and save them in a `.npz` file, which we simply load here via a `read_npz` function (can skip these details)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6cc832fd",
   "metadata": {},
   "source": [
    "<details><summary>See the code for reading the `.npz` file **(click to expand)**</summary> \n",
    "\n",
    "```python\n",
    "# Note: This pulldown content is for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.\n",
    "\n",
    "def read_npz(filepath): \n",
    "    data = dict(np.load(filepath)) \n",
    "    data = [data[str(i)] for i in range(len(data))] \n",
    "    return data \n",
    "\n",
    "```\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ab9d59a0",
   "metadata": {
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "def read_npz(filepath): \n",
    "    data = dict(np.load(filepath)) \n",
    "    data = [data[str(i)] for i in range(len(data))] \n",
    "    return data "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "519cb80c",
   "metadata": {},
   "outputs": [],
   "source": [
    "pred_probs = read_npz('pred_probs.npz') "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a8136f37",
   "metadata": {},
   "source": [
    "`pred_probs` is a list of numpy arrays, which we'll describe later. Let's first also load the dataset and its labels. We collect sentences from the original text files defining: \n",
    "\n",
    "- `tokens` as a nested list where `tokens[i]` is a list of strings corrsesponding to a (word-level) tokenized version of the `i`-th sentence\n",
    "- `given_labels` as a nested list of the given labels in the dataset where `given_labels[i]` is a list of labels for each token in the `i`-th sentence. \n",
    "\n",
    "This version of CoNLL-2003 uses IOB2-formatting for tagging, where `B-` and `I-` prefixes in the class labels indicate whether the tokens are at the start of an entity or in the middle. We ignore these distinctions in this tutorial (as label errors that confuse `B-` and `I-` are less interesting), and thus have two sets of entities: \n",
    "\n",
    "- `given_entities` = ['O', 'B-MISC', 'I-MISC', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']  \n",
    "- `entities` = ['O', 'MISC', 'PER', 'ORG', 'LOC']. These are our classes of interest for the token classification task.\n",
    "\n",
    "We use some helper methods to load the CoNLL data (can skip these details)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "43a87745",
   "metadata": {},
   "source": [
    "<details><summary>See the code for reading the CoNLL data files **(click to expand)**</summary>\n",
    "\n",
    "```python\n",
    "\n",
    "# Note: This pulldown content is for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.\n",
    "\n",
    "given_entities = ['O', 'B-MISC', 'I-MISC', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']\n",
    "entities = ['O', 'MISC', 'PER', 'ORG', 'LOC'] \n",
    "entity_map = {entity: i for i, entity in enumerate(given_entities)} \n",
    "\n",
    "def readfile(filepath, sep=' '): \n",
    "    lines = open(filepath)\n",
    "    data, sentence, label = [], [], []\n",
    "    for line in lines:\n",
    "        if len(line) == 0 or line.startswith('-DOCSTART') or line[0] == '\\n':\n",
    "            if len(sentence) > 0:\n",
    "                data.append((sentence, label))\n",
    "                sentence, label = [], []\n",
    "            continue\n",
    "        splits = line.split(sep) \n",
    "        word = splits[0]\n",
    "        if len(word) > 0 and word[0].isalpha() and word.isupper():\n",
    "            word = word[0] + word[1:].lower()\n",
    "        sentence.append(word)\n",
    "        label.append(entity_map[splits[-1][:-1]])\n",
    "\n",
    "    if len(sentence) > 0:\n",
    "        data.append((sentence, label))\n",
    "\n",
    "    tokens = [d[0] for d in data] \n",
    "    given_labels = [d[1] for d in data]\n",
    "    return tokens, given_labels\n",
    "\n",
    "```\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "202f1526",
   "metadata": {
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "given_entities = ['O', 'B-MISC', 'I-MISC', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC']\n",
    "entities = ['O', 'MISC', 'PER', 'ORG', 'LOC'] \n",
    "entity_map = {entity: i for i, entity in enumerate(given_entities)} \n",
    "\n",
    "def readfile(filepath, sep=' '): \n",
    "    lines = open(filepath)\n",
    "    data, sentence, label = [], [], []\n",
    "    for line in lines:\n",
    "        if len(line) == 0 or line.startswith('-DOCSTART') or line[0] == '\\n':\n",
    "            if len(sentence) > 0:\n",
    "                data.append((sentence, label))\n",
    "                sentence, label = [], []\n",
    "            continue\n",
    "        splits = line.split(sep) \n",
    "        word = splits[0]\n",
    "        if len(word) > 0 and word[0].isalpha() and word.isupper():\n",
    "            word = word[0] + word[1:].lower()\n",
    "        sentence.append(word)\n",
    "        label.append(entity_map[splits[-1][:-1]])\n",
    "\n",
    "    if len(sentence) > 0:\n",
    "        data.append((sentence, label))\n",
    "        \n",
    "    tokens = [d[0] for d in data] \n",
    "    given_labels = [d[1] for d in data] \n",
    "    return tokens, given_labels "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a4381f03",
   "metadata": {},
   "outputs": [],
   "source": [
    "filepaths = ['data/train.txt', 'data/valid.txt', 'data/test.txt'] \n",
    "tokens, given_labels = [], [] \n",
    "\n",
    "for filepath in filepaths: \n",
    "    words, label = readfile(filepath) \n",
    "    tokens.extend(words) \n",
    "    given_labels.extend(label)\n",
    "    \n",
    "sentences = list(map(get_sentence, tokens)) \n",
    "\n",
    "sentences, mask = filter_sentence(sentences) \n",
    "tokens = [words for m, words in zip(mask, tokens) if m] \n",
    "given_labels = [labels for m, labels in zip(mask, given_labels) if m] \n",
    "\n",
    "maps = [0, 1, 1, 2, 2, 3, 3, 4, 4] \n",
    "labels = [mapping(labels, maps) for labels in given_labels] "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "46cb7c93",
   "metadata": {},
   "source": [
    "To find label issues in token classification data, cleanlab requires `labels` and `pred_probs`, which should look as follows: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7842e4a3",
   "metadata": {},
   "outputs": [],
   "source": [
    "indices_to_preview = 3  # increase this to view more examples\n",
    "for i in range(indices_to_preview):\n",
    "    print('\\nsentences[%d]:\\t' % i + str(sentences[i])) \n",
    "    print('labels[%d]:\\t' % i + str(labels[i])) \n",
    "    print('pred_probs[%d]:\\n' % i + str(pred_probs[i])) "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b71eb4a",
   "metadata": {},
   "source": [
    "Note that these correspond to the sentences in the dataset, where each sentence is treated as an individual training example (could be document instead of sentence).  If using your own dataset, both `pred_probs` and `labels` should each be formatted as a nested-list where: \n",
    "\n",
    "- `pred_probs` is a list whose `i`-th element is a np.ndarray of shape `(N_i, K)` corresponding to predicted class probabilities for each token in the `i`-th sentence (assuming this sentence contains `N_i` tokens and dataset has `K` possible classes). Each row of one np.ndarray corresponds to a token `t` and contains a model's predicted probability  that `t` belongs to each possible class, for each of the K classes. The columns must be ordered such that the probabilities correspond to class 0, 1, ..., K-1. These should be out-of-sample `pred_probs` obtained from a token classification model via cross-validation. \n",
    "\n",
    "- `labels` is a list whose `i`-th element is a list of integers corresponding to class label of each token in the `i`-th sentence. For dataset with K classes, labels must take values in 0, 1, ..., K-1. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1dc3150f",
   "metadata": {},
   "source": [
    "## 3. Use cleanlab to find label issues \n",
    "\n",
    "Based on the given labels and out-of-sample predicted probabilities, cleanlab can quickly help us identify label issues in our dataset. Here we request that the indices of the identified label issues be sorted by cleanlab’s self-confidence score, which measures the quality of each given label via the probability assigned to it in our model’s prediction. The returned `issues` are a list of tuples `(i, j)`, which corresponds to the `j`th token of the `i`-th sentence in the dataset. These are the tokens cleanlab thinks may be badly labeled in your dataset. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2c2ad9ad",
   "metadata": {},
   "outputs": [],
   "source": [
    "issues = find_label_issues(labels, pred_probs) "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7221c12b",
   "metadata": {},
   "source": [
    "Let's look at the top 20 tokens that cleanlab thinks are most likely mislabeled. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "95dc7268",
   "metadata": {},
   "outputs": [],
   "source": [
    "top = 20  # increase this value to view more identified issues\n",
    "print('Cleanlab found %d potential label issues. ' % len(issues)) \n",
    "print('The top %d most likely label errors:' % top) \n",
    "print(issues[:top]) "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65421a2d",
   "metadata": {},
   "source": [
    "We can better decide how to handle these issues by viewing the original sentences containing these tokens.\n",
    "Given that `O` and `MISC` classes (corresponding to integers 0 and 1 in our class ordering) can sometimes be ambiguous, they are excluded from our visualization below. This is achieved via the `exclude` argument, a list of tuples `(i, j)` such that tokens predicted as `entities[j]` but labeled as `entities[i]` are ignored."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e13de188",
   "metadata": {},
   "outputs": [],
   "source": [
    "display_issues(issues, tokens, pred_probs=pred_probs, labels=labels, \n",
    "               exclude=[(0, 1), (1, 0)], class_names=entities) "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "96d04902",
   "metadata": {},
   "source": [
    "More than half of the potential label issues correspond to tokens that are incorrectly labeled. As shown above, some examples are ambigious and may require more thoughful handling. cleanlab has also discovered some edge cases such as tokens which are simply punctuations such as `/` and `(`. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d213b2b2",
   "metadata": {},
   "source": [
    "### Most common word-level token mislabels \n",
    "\n",
    "We may also wish to understand which tokens tend to be most commonly mislabeled throughout the entire dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e4a006bd",
   "metadata": {},
   "outputs": [],
   "source": [
    "info = common_label_issues(issues, tokens, \n",
    "                           labels=labels, \n",
    "                           pred_probs=pred_probs, \n",
    "                           class_names=entities, \n",
    "                           exclude=[(0, 1), (1, 0)]) "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c417061",
   "metadata": {},
   "source": [
    "The printed information above is also stored in pd.DataFrame `info`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a35ef843",
   "metadata": {},
   "source": [
    "### Find sentences containing a particular mislabeled word \n",
    "\n",
    "You can also only focus on the subset of potentially problematic sentences where a particular token may have been mislabeled."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c8f4e163",
   "metadata": {},
   "outputs": [],
   "source": [
    "token_issues = filter_by_token('United', issues, tokens)\n",
    "\n",
    "display_issues(token_issues, tokens, pred_probs=pred_probs, labels=labels, \n",
    "               exclude=[(0, 1), (1, 0)], class_names=entities) "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1759108b",
   "metadata": {},
   "source": [
    "###  Sentence label quality score \n",
    "\n",
    "For best reviewing label issues in a token classification dataset, you want to look at sentences one at a time. Here sentences more likely to contain a label error should be ranked earlier. Cleanlab can provide an overall label quality score for each sentence (ranging from 0 to 1) such that lower scores indicate sentences more likely to contain some mislabeled token. We can also obtain label quality scores for each individual token and manually decide which of these are label issues by thresholding them. For automatically estimating which tokens are mislabeled (and the number of label errors), you should use `find_label_issues()` instead. `get_label_quality_scores()` is useful if you only have time to review a few sentences and want to prioritize which, or if you're specifically aiming to detect label errors with high precision (or high recall) rather than overall estimation of the set of mislabeled tokens."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "db0b5179",
   "metadata": {},
   "outputs": [],
   "source": [
    "sentence_scores, token_scores = get_label_quality_scores(labels, pred_probs)\n",
    "issues = issues_from_scores(sentence_scores, token_scores=token_scores) \n",
    "display_issues(issues, tokens, pred_probs=pred_probs, labels=labels, \n",
    "               exclude=[(0, 1), (1, 0)], class_names=entities) "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1759108c",
   "metadata": {},
   "source": [
    "## How does cleanlab.token_classification work?\n",
    "\n",
    "The underlying algorithms used to produce these scores are described in [this paper](https://arxiv.org/abs/2210.03920)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a18795eb",
   "metadata": {
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "# Note: This cell is only for docs.cleanlab.ai, if running on local Jupyter or Colab, please ignore it.\n",
    "highlighted_indices = [(2907, 0), (19392, 0), (9962, 4), (8904, 30), (19303, 0), \n",
    "                       (12918, 0), (9256, 0), (11855, 20), (18392, 4), (20426, 28), \n",
    "                       (19402, 21), (14744, 15), (19371, 0), (4645, 2), (83, 9), \n",
    "                       (10331, 3), (9430, 10), (6143, 25), (18367, 0), (12914, 3)] \n",
    "\n",
    "if not all(x in issues for x in highlighted_indices):\n",
    "    raise Exception(\"Some highlighted examples are missing from ranked_label_issues.\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
